Skip to content

Comments

Extracts BitString from code point to BitString class#723

Draft
mward-sudo wants to merge 8 commits intobartblast:devfrom
mward-sudo:02-19-extracts_bitstring_from_code_point_to_bitstring_class
Draft

Extracts BitString from code point to BitString class#723
mward-sudo wants to merge 8 commits intobartblast:devfrom
mward-sudo:02-19-extracts_bitstring_from_code_point_to_bitstring_class

Conversation

@mward-sudo
Copy link
Contributor

@mward-sudo mward-sudo commented Feb 19, 2026

Closes #720

Dependencies

Please note that this PR includes commits from the PR(s) it is dependent upon. Once the dependent PR(s) are merged to the dev branch, then this PR will be rebased and will then only contain its own commits. This PR will remain in draft until that point.

Summary by CodeRabbit

  • New Features

    • Added UTF-8 character encoding and decoding capabilities with comprehensive validation and handling.
  • Refactor

    • Streamlined UTF-8 processing with centralized utility functions, improving code maintainability and consistency.
  • Tests

    • Added extensive test coverage for UTF-8 operations, character encoding, decoding, and validation scenarios.

@coderabbitai
Copy link

coderabbitai bot commented Feb 19, 2026

📝 Walkthrough

Walkthrough

Added UTF-8 utilities to the Bitstring class for decoding/encoding code points and validating UTF-8 sequences. Refactored unicode.mjs to use these new Bitstring helpers instead of custom UTF-8 validation and codepoint extraction logic. Added comprehensive test coverage for the new utilities.

Changes

Cohort / File(s) Summary
UTF-8 Utilities in Bitstring
assets/js/bitstring.mjs
Added 8 new static UTF-8 helper methods: decodeUtf8CodePoint, fromCodepoint, toCodepointArray, getValidUtf8Length, isValidUtf8CodePoint, isValidUtf8ContinuationByte, isValidUtf8Sequence, isTruncatedUtf8Sequence. These centralize UTF-8 encoding/decoding and validation logic for use across the codebase.
Unicode Module Refactoring
assets/js/erlang/unicode.mjs
Replaced low-level bitstring construction and custom UTF-8 validation with calls to new Bitstring utilities. Updated characters_to_binary/3, handleInvalidUtf8FromBinary, handleInvalidUtf8FromList, and other functions to use Bitstring.fromCodepoint, Bitstring.toCodepointArray, Bitstring.getValidUtf8Length, and Bitstring.isTruncatedUtf8Sequence instead of manual implementations.
UTF-8 Utilities Test Coverage
test/javascript/bitstring_test.mjs
Added comprehensive test suite for new UTF-8 utilities covering decodeUtf8CodePoint, fromCodepoint, toCodepointArray, getValidUtf8Length, isTruncatedUtf8Sequence, isValidUtf8CodePoint, isValidUtf8ContinuationByte, and isValidUtf8Sequence with various edge cases and byte sequences.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • bartblast
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main objective of moving BitString code point conversion utilities into the BitString class, matching the core intent of issue #720.
Linked Issues check ✅ Passed The PR successfully implements the requirement from issue #720 by adding toCodepointArray(bitstring) as a static method and providing comprehensive UTF-8 conversion utilities in the BitString class.
Out of Scope Changes check ✅ Passed All changes are directly related to extracting BitString/code point utilities into the BitString class. The refactoring in unicode.mjs uses the new utilities as intended, with no extraneous modifications detected.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
assets/js/bitstring.mjs (1)

259-259: Allocating lookup-table objects per call in hot-path methods.

Both decodeUtf8CodePoint (line 259) and isValidUtf8CodePoint (line 630) recreate a small object literal on every invocation. These are called in the O(n) scan inside getValidUtf8Length, once per multi-byte sequence.

The firstByteMasks values follow the pattern 0x7f >> length, which avoids the allocation entirely, and minValueForLength can be a module-level constant array.

♻️ Zero-allocation alternatives
  static decodeUtf8CodePoint(bytes, start, length) {
    if (length === 1) return bytes[start];

-   // First byte masks: 2-byte=0x1f, 3-byte=0x0f, 4-byte=0x07
-   const firstByteMasks = {2: 0x1f, 3: 0x0f, 4: 0x07};
-
-   let codePoint = bytes[start] & firstByteMasks[length];
+   // First byte masks: 2→0x1f, 3→0x0f, 4→0x07 (formula: 0x7f >> length)
+   let codePoint = bytes[start] & (0x7f >> length);

    for (let i = 1; i < length; i++) {

For isValidUtf8CodePoint, hoist minValueForLength to a module-level (or class-level) constant:

+// Module-level constant; indices 0–4, only 1–4 used.
+const UTF8_MIN_CODE_POINT = [0, 0, 0x80, 0x800, 0x10000];

  static isValidUtf8CodePoint(codePoint, encodingLength) {
-   const minValueForLength = {1: 0, 2: 0x80, 3: 0x800, 4: 0x10000};
-   if (codePoint < minValueForLength[encodingLength]) return false;
+   if (codePoint < UTF8_MIN_CODE_POINT[encodingLength]) return false;

Also applies to: 630-630

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/js/bitstring.mjs` at line 259, The code repeatedly allocates small
lookup objects inside hot-path functions decodeUtf8CodePoint and
isValidUtf8CodePoint (e.g. firstByteMasks and minValueForLength); hoist these to
module-level constants and replace the firstByteMasks literal with the computed
pattern (use 0x7f >> length) or a precomputed array to avoid per-call object
creation, and make minValueForLength a shared constant array used by both
functions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@assets/js/bitstring.mjs`:
- Line 259: The code repeatedly allocates small lookup objects inside hot-path
functions decodeUtf8CodePoint and isValidUtf8CodePoint (e.g. firstByteMasks and
minValueForLength); hoist these to module-level constants and replace the
firstByteMasks literal with the computed pattern (use 0x7f >> length) or a
precomputed array to avoid per-call object creation, and make minValueForLength
a shared constant array used by both functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extract conversion of bitstring to code point array to Bitstring class

1 participant